{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "95f0a171",
   "metadata": {},
   "source": [
    "(iteration)=\n",
    "# Iteration\n",
    "\n",
    "## Introduction\n",
    "\n",
    "\n",
    "In {ref}`functions`, we talked about how important it is to reduce duplication in your code by creating functions instead of copying-and-pasting. Reducing code duplication has three main benefits:\n",
    "\n",
    "1.  It's easier to see the intent of your code, because your eyes are drawn to what's different, not what stays the same.\n",
    "\n",
    "2.  It's easier to respond to changes in requirements.\n",
    "    As your needs change, you only need to make changes in one place, rather than remembering to change every place that you copied-and-pasted the code.\n",
    "\n",
    "3.  You're likely to have fewer bugs because each line of code is used in more places.\n",
    "\n",
    "One tool for reducing duplication is functions, which reduce duplication by identifying repeated patterns of code and extract them out into independent pieces that can be easily reused and updated. Another tool for reducing duplication is *iteration*, which helps you when you need to do the same thing to multiple inputs: repeating the same operation on different columns, or on different datasets.\n",
    "\n",
    "In this chapter you'll learn about iteration in three ways: explicit iteration, using for loops and while loops; iteration via comprehensions (eg list comprehensions); and iteration for **pandas** data frames."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "51a55374",
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "# remove cell\n",
    "import matplotlib.pyplot as plt\n",
    "import matplotlib_inline.backend_inline\n",
    "\n",
    "# Plot settings\n",
    "plt.style.use(\"https://github.com/aeturrell/python4DS/raw/main/plot_style.txt\")\n",
    "matplotlib_inline.backend_inline.set_matplotlib_formats(\"svg\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "17575f3a",
   "metadata": {},
   "source": [
    "### Prerequisites\n",
    "\n",
    "This chapter will use the **pandas** data analysis package."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "19e65736",
   "metadata": {},
   "source": [
    "## For Loops\n",
    "\n",
    "A loop is a way of executing a similar piece of code over and over in a similar way.\n",
    "\n",
    "A `for` loop does something *for* the time that the condition is satisfied. For example,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a2bbd41c",
   "metadata": {},
   "outputs": [],
   "source": [
    "name_list = [\"Lovelace\", \"Smith\", \"Pigou\", \"Babbage\"]\n",
    "\n",
    "for name in name_list:\n",
    "    print(name)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eb31bf67",
   "metadata": {},
   "source": [
    "prints out a name until all names have been printed out.\n",
    "\n",
    "Every for loop has three components:\n",
    "\n",
    "1.  The *output*, here a print statement. But you can imagine a for loop that populates each entry of a data frame or list (but you should always create the full Python object first and populate it later rather than changing its size within the loop because the latter is slow).\n",
    "\n",
    "\n",
    "2.  The *sequence*: `for name in name_list:`.\n",
    "    This determines what to loop over: each run of the for loop will assign `name` to a different value from the iterable `name_list`. It doesn't have to be a list, any iterable object will do.\n",
    "    It's useful to think of `name` above as a pronoun, like \"it\".\n",
    "\n",
    "3.  The *body*: `print(name)`.\n",
    "    This is the code that does the work.\n",
    "    It's run repeatedly, each time with a different value for `name`.\n",
    "    The first iteration will effectively run `print(name_list[0])`, the second will run `print(name_list[1])`, and so on.\n",
    "\n",
    "\n",
    "As long as your object is an iterable (ie you can iterate over it), then it can be used in this way in a for loop. The most common examples are lists and tuples, but you can also iterate over strings (in which case each character is selected in turn). One gotcha to be aware of is if you iterate over a string, say \"hello\", instead of iterating over a *list (or tuple) of strings*, eg `[\"hello\"]`. In the latter case, you get:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "835ebda7",
   "metadata": {},
   "outputs": [],
   "source": [
    "for entry in [\"hello\"]:\n",
    "    print(entry)\n",
    "    print(\"---end entry---\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5a6dec3e",
   "metadata": {},
   "source": [
    "While in the former you get something quite different and typically not all that useful:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2a19ac2e",
   "metadata": {},
   "outputs": [],
   "source": [
    "for entry in \"hello\":\n",
    "    print(entry)\n",
    "    print(\"---end entry---\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2a080093",
   "metadata": {},
   "source": [
    "```{admonition} Exercise\n",
    "Write a for loop that prints out \"Python for Data Science\" so that each word is printed in a successive iteration.\n",
    "```\n",
    "\n",
    "A useful trick with for loops is the `enumerate()` keyword, which runs through an index that keeps track of the place of items in a list:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "239e133f",
   "metadata": {},
   "outputs": [],
   "source": [
    "name_list = [\"Lovelace\", \"Smith\", \"Hopper\", \"Babbage\"]\n",
    "\n",
    "for i, name in enumerate(name_list):\n",
    "    print(f\"The name in position {i} is {name}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d3547896",
   "metadata": {},
   "source": [
    "Remember, Python indexes from 0 so the first entry of `i` will be zero. But, if you'd like to index from a different number, you can:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b66c5c53",
   "metadata": {},
   "outputs": [],
   "source": [
    "for i, name in enumerate(name_list, start=1):\n",
    "    print(f\"The name in position {i} is {name}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b8c12f6d",
   "metadata": {},
   "source": [
    "Another useful pattern when doing for loops with dictionaries is iteration over key, value pairs. We'll get to learn more about dictionaries very shortly, but for now what's important is that they map a key to a value, for example \"apple\" might map to \"fruit\". Let's take our example from earlier that mapped cities to temperatures. If we wanted to iterate over *both* keys and values, we can write a for loop like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "010239fe",
   "metadata": {},
   "outputs": [],
   "source": [
    "cities_to_temps = {\"Paris\": 28, \"London\": 22, \"Seville\": 36, \"Wellesley\": 29}\n",
    "\n",
    "for key, value in cities_to_temps.items():\n",
    "    print(f\"In {key}, the temperature is {value} degrees C today.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "42aef792",
   "metadata": {},
   "source": [
    "Note that we added `.items()` to the end of the dictionary. And note that we didn't *have* to call the key `key`, or the value `value`: these are set by their position. But part of best practice in writing code is that *there should be no surprises*, and writing key, value makes it really clear that you're using values from a dictionary.\n",
    "\n",
    "```{admonition} Exercise\n",
    "Write a dictionary that maps four cities you know into their respective countries and print the results using the `key, value` iteration trick.\n",
    "```\n",
    "\n",
    "Another useful type of for loop is provided by the `zip()` function. You can think of the `zip()` function as being like a zipper, bringing elements from two different iterators together in turn. Here's an example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8ea3efc5",
   "metadata": {},
   "outputs": [],
   "source": [
    "first_names = [\"Ada\", \"Adam\", \"Grace\", \"Charles\"]\n",
    "last_names = [\"Lovelace\", \"Smith\", \"Hopper\", \"Babbage\"]\n",
    "\n",
    "for forename, surname in zip(first_names, last_names):\n",
    "    print(f\"{forename} {surname}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c07737f3",
   "metadata": {},
   "source": [
    "The zip function is super useful in practice.\n",
    "\n",
    "```{admonition} Exercise\n",
    "Zip together the first names from above with this jumbled list of surnames: `['Babbage', 'Hopper', 'Smith', 'Lovelace']`.\n",
    "\n",
    "(Hint: you have seen a trick to help re-arrange lists earlier on in the Chapter.)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ba79c3dc",
   "metadata": {},
   "source": [
    "## List (and Other) Comprehensions\n",
    "\n",
    "There's a second way to do loops in Python and, in most but [not all](https://towardsdatascience.com/list-comprehensions-vs-for-loops-it-is-not-what-you-think-34071d4d8207) [cases](https://stackoverflow.com/questions/22108488/are-list-comprehensions-and-functional-functions-faster-than-for-loops), they run faster. More importantly, and *this* is the reason it's good practice to use them where possible, they are very readable. They are called *list comprehensions*.\n",
    "\n",
    "List comprehensions can combine what a `for` loop and (if needed) what a `condition` do in a single line of code. First, let's look at a `for` loop that adds one to each value done as a list comprehension (NB: in practice, we would use super-fast **numpy** arrays for this kind of operation):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7efed381",
   "metadata": {},
   "outputs": [],
   "source": [
    "num_list = range(50, 60)\n",
    "[1 + num for num in num_list]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aaca4b6b",
   "metadata": {},
   "source": [
    "The general pattern is a bit similar to with the `for` loop but there are some differences. There's no colon, and no indenting. The syntax is \"do something with `x`\" then `for x in iterable`. Finally, the expression is wrapped in a `[` and `]` to make the output a list.\n",
    "\n",
    "Note that lists are not the only wrapping you can provide to this kind of structure. A `(` and `)` to make it a generator (don't worry about what this is for now), a `{` and `}` to make it a set (an object that only contains unique values), or it's possible to create a dictionary from a comprehension too! List comprehensions are the most common, so if you only remember one kind, remember them.\n",
    "\n",
    "```{admonition} Exercise\n",
    "Create a list comprehension that multiplies numbers in the range from 1 to 10 by 5.\n",
    "\n",
    "Did you get the range right?\n",
    "```\n",
    "\n",
    "Let's now see how to include a condition within a list comprehension. Say we had a list of numbers and wanted to filter it according to whether the numbers divided by 3 or not using the modulo operator:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "722fda21",
   "metadata": {},
   "outputs": [],
   "source": [
    "number_list = range(1, 40)\n",
    "divide_list = [x for x in number_list if x % 3 == 0]\n",
    "print(divide_list)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "326a9c20",
   "metadata": {},
   "source": [
    "The syntax here is do something to `x` for `x` in something if `x` satisfies some condition.\n",
    "\n",
    "Here's another example that picks out only the names that include 'Smith' in them:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b6e80d6b",
   "metadata": {},
   "outputs": [],
   "source": [
    "names_list = [\"Joe Bloggs\", \"Adam Smith\", \"Sandra Noone\", \"leonara smith\"]\n",
    "smith_list = [x for x in names_list if \"smith\" in x.lower()]\n",
    "print(smith_list)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f67de061",
   "metadata": {},
   "source": [
    "Note how we used 'smith' rather than 'Smith' and then used `lower()` to ensure we matched names regardless of the case they are written in.\n",
    "\n",
    "We can even do a whole `if` ... `else` construct *inside* a list comprehension:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f348bfb6",
   "metadata": {},
   "outputs": [],
   "source": [
    "names_list = [\"Joe Bloggs\", \"Adam Smith\", \"Sandra Noone\", \"leonara smith\"]\n",
    "smith_list = [x if \"smith\" in x.lower() else \"Not Smith!\" for x in names_list]\n",
    "print(smith_list)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "72548d09",
   "metadata": {},
   "source": [
    "Many of the constructs we've seen can be combined. For instance, there is no reason why we can't have a nested or repeated list comprehension using `zip()`, and, perhaps more surprisingly, sometimes these are useful!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "74e4fcc7",
   "metadata": {},
   "outputs": [],
   "source": [
    "first_names = [\"Ada\", \"Adam\", \"Grace\", \"Charles\"]\n",
    "last_names = [\"Lovelace\", \"Smith\", \"Hopper\", \"Babbage\"]\n",
    "names_list = [x + \" \" + y for x, y in zip(first_names, last_names)]\n",
    "print(names_list)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7ec0597d",
   "metadata": {},
   "source": [
    "An even more extreme use of list comprehensions can deliver nested structures:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2c82cf1f",
   "metadata": {},
   "outputs": [],
   "source": [
    "first_names = [\"Ada\", \"Adam\"]\n",
    "last_names = [\"Lovelace\", \"Smith\"]\n",
    "names_list = [[x + \" \" + y for x in first_names] for y in last_names]\n",
    "print(names_list)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9b83ece8",
   "metadata": {},
   "source": [
    "This gives a nested structure that (in this case) iterates over `first_names` first, and then `last_names`. (Note that this object is a list of lists of strings!)\n",
    "\n",
    "Let's see a dictionary comprehension now. These look a bit similar to set comprehensions because they use `{` and `}` at either end but they are different because they come with a colon separating the keys from the values:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "acef16ae",
   "metadata": {},
   "outputs": [],
   "source": [
    "{key: value for key, value in zip(first_names, last_names)}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d18be80b",
   "metadata": {},
   "source": [
    "```{admonition} Exercise\n",
    "Create a nested list comprehension that results in a list of lists of strings equal to `[['a0', 'b0', 'c0'], ['a1', 'b1', 'c1'], ['a2', 'b2', 'c2']]` (ie a combination of the first three integers and letters of the alphabet). You may find that you need to convert numbers to strings using `str(x)` to do this.\n",
    "```\n",
    "\n",
    "If you'd like to learn more about list comprehensions, check out these [short video tutorials](https://calmcode.io/comprehensions/introduction.html).\n",
    "\n",
    "## While Loops\n",
    "\n",
    "`while` loops continue to execute code until their conditional expression evaluates to `False`. (Of course, if it evaluates to `True` forever, your code will just continue to execute...)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3e47ba02",
   "metadata": {},
   "outputs": [],
   "source": [
    "n = 10\n",
    "while n > 0:\n",
    "    print(n)\n",
    "    n -= 1\n",
    "\n",
    "print(\"execution complete\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0131cc60",
   "metadata": {},
   "source": [
    "NB: in case you're wondering what `-=` does, it's a compound assignment that sets the left-hand side equal to the left-hand side minus the right-hand side.\n",
    "\n",
    "You can use the keyword `break` to break out of a while loop, for example if it's reached a certain number of iterations without converging.\n",
    "\n",
    "```{admonition} Exercise\n",
    "Making use of `import string` and then `string.ascii_lowercase` to get the characters in the alphabet, write a while loop that iterates backwards through the alphabet (starting at \"z\") before printing \"execution complete\".\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5ec0643e",
   "metadata": {},
   "source": [
    "## Iteration with **pandas** Data Frames\n",
    "\n",
    "For loops, while loops, and comprehensions all work on **pandas** data frames, but they are generally a bad way to get things done because they are slow and not memory efficient. To aid cases where iteration is needed, **pandas** has built-in methods for iteration depending on what you need to do.\n",
    "\n",
    "These built-in methods for iteration have an overlap with what we've seen in {ref}`data-transform` but we'll dig a little deeper into `assign()`/assignment operations, `apply()`, and `eval()` here.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "791e0437",
   "metadata": {},
   "source": [
    "### Assignment Operations and `assign`\n",
    "\n",
    "An assignment is a statement that *assigns* the value on the right to the object on the left with an equals sign in the middle.\n",
    "\n",
    "Let's imagine we have a data frame like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b3116809",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "df = pd.DataFrame(np.random.normal(size=(6, 4)), columns=[\"a\", \"b\", \"c\", \"d\"])\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0e826ad5",
   "metadata": {},
   "source": [
    "**pandas** has many built-in functions that are already built to iterate over rows and columns; for example, to compute the median of rows or columns respectively:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ac909c2f",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.median(axis=\"rows\")  # can also use axis=1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "96426002",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.median(axis=\"columns\")  # can also use axis=0"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "52ad8ea5",
   "metadata": {},
   "source": [
    "In these cases, and others using built-in functions, the iteration is hidden. What if we want to do something that isn't a built in and also isn't an aggregation though? Let's take the example of adding five to every entry. We could do it by explicitly iterating row by row, then repeat that for each column, ie"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "060b6815",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Do not do this!\n",
    "\n",
    "\n",
    "def add_five_slow(df):\n",
    "    for i in range(len(df)):\n",
    "        for j in range(len(df.columns)):\n",
    "            df.iloc[i, j] = df.iloc[i, j] + 5\n",
    "\n",
    "\n",
    "%timeit add_five_slow(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8246132e",
   "metadata": {},
   "source": [
    "But to do this, every individual cell must be accessed and operated on—so it is very slow, taking milliseconds. **pandas** has far faster ways of performing the same operation. For simple operations on data frames with consistent type, you can simply add five to the whole data frame:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1a48ae52",
   "metadata": {},
   "outputs": [],
   "source": [
    "%timeit df + 5"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a81336b9",
   "metadata": {},
   "source": [
    "This took tens of *microseconds*, much faster."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7313616e",
   "metadata": {},
   "source": [
    "This also works on a per column basis, so you can do `df[\"a\"] = df[\"a\"] + 5` and so on.\n",
    "\n",
    "These operations have equivalents using the `assign()` operator, which allows for *method chaining*; stringing multiple operations together. The `assign()` operator version of `df[\"new_a\"] = df[\"a\"] + 5` would be"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f7391dc5",
   "metadata": {},
   "outputs": [],
   "source": [
    "df = df.assign(new_a=lambda x: x[\"a\"] + 5)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "76aec162",
   "metadata": {},
   "source": [
    "### Apply\n",
    "\n",
    "What happens if you have a more complicated function you want to iterate over? This is where **pandas**' `apply()` comes in, and can be used with assignment. `apply()` can also be used across rows or columns. Like `assign()`, it can be combined with a lambda function and used with either the whole data frame or just a column (in which case no need to specify `axis=`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "31adcb3f",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.apply(lambda x: x[\"a\"] - x[\"new_a\"].mean() * x[\"c\"] / x[\"b\"], axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "78b558f4",
   "metadata": {},
   "source": [
    "Note that this is just an example: you could still do this entire operation without using apply! But you will sometimes find yourself with cases where you do need to use it.\n",
    "\n",
    "Apply also works with functions, including user-defined functions:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "136d435d",
   "metadata": {},
   "outputs": [],
   "source": [
    "def complicated_function(x):\n",
    "    return x - x.mean()\n",
    "\n",
    "\n",
    "df = df.apply(complicated_function, axis=1)\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "171be2c9",
   "metadata": {},
   "source": [
    "### Eval(uate)\n",
    "\n",
    "`eval()` evaluates a string describing operations on DataFrame columns to create new columns. It operates on columns only, not rows or elements. Here's an example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8d9defd3",
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"ratio\"] = df.eval(\"a / new_a\")\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8b275b5b",
   "metadata": {},
   "source": [
    "Evaluate can also be used to create new boolean columns using, for example, a string `\"a > 0.5\"` in the above example."
   ]
  }
 ],
 "metadata": {
  "interpreter": {
   "hash": "9d7534ecd9fbc7d385378f8400cf4d6cb9c6175408a574f1c99c5269f08771cc"
  },
  "jupytext": {
   "cell_metadata_filter": "-all",
   "encoding": "# -*- coding: utf-8 -*-",
   "formats": "md:myst",
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  },
  "toc-showtags": true
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
